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ABSTBACT 

Pour essential dimensions of a performance test are 
detailedr directness of teat method, type of criterion^ 
standardization of conditions, and objectivity of scoring. For 
simplicity these factors are described as if each were dichotomous, 
when in actuality each is a continuumi a teat method may be more or 
less direct, conditions more or less standardized, Moreover, as shown 
here, the dimensions are depicted as independent, when in practice 
they are not----for instance, indirect methods of testing are often 
used to attain objective scoring- and process criteria to achieFe 
standardized conditiona* Nevertheless, this simple framework provides 
a useful analytic tool for developers and users of performance tests. 
It can guide the development of a test^ or be used after the fact to 
identify ifeaknesaes ill existing tests. Hore generally it defines 
problem areas confronting the performance testing 

practitioner — problem areas which must be addressed by research anfi 
creative development work if performance tests ' are to be used 
reliably and validly, {Author/BC) 
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ESSENTIAL DIMENSIONS OF PERFORMANCE TESTS 



William C, Osbom 



My remarks today ai^e basad on a conceptual analysis of factors which constrain the 
developmenit of valid and reliable performance testSp As one who for several years has 
been involved in the developnaent of performance testg^ I am particularly attuned to the 
practical problems encountered in trying to provide what might be termed efficient tests— 
that is, tests which are vdid end reliablej but also which are usable in the vety re^ sense 
of evaluating the proficiency at numbers of people at minimum ccst in time and 
resources. It is this tradeoff— test quality versus administrative eeonomy— that lies at the 
heart of the performance testing problem. 

Although psrformanca taste have other purposes, they g^e used chiefly in evaJuating 
training outconi es. Having received training on a job-task (or tasks), a trainee is normally 
required to demonstmte proficiency on the task before he is advanced to the next stage 
of training, or ultimately, out of training and onto the job. The development and use 
of such tests would seem to be straightf srward : the job relevant conditions for tmk 
performance are created and an acceptable criterion of performance defined. Then the 
trainee is asked to perform, and his perfoimmce is evduated against the established cri- 
terion. Unfortunately, the nature of certain types of job-taskSj together with time and 
cost constraints, often cr^eate problems for the test developer. In circumventing these 
problems he frequently resorts to aimipUstic test proceduris of questionable reliability 
or validity. More grave, however, is the fact that such compromises so frequently occur— 
app-arently either because of inadequate regard for the price one pays in diminishing reii- 
ability and validity, or because of a lack of awareness of alternate approaches. 

My objective today is to set forth in a simple conceptud framework, what 1 see to 
be the essential dimensions of a performance test-^ssential in the sense that they com- 
prise the key practical factors in achlei^iBg test reliabilfty and v^ldity. Within this frame- 
work i will identify the more comniion shortcomings of performance tests and then 
suggestj where 1 can^ possible directions for improvement. 

One finM caveat before going on: the descriptive model that I will discuss is 
limited to test development for individual tasks and does not touch on other aspects 
of reliability and validity^uch as sampling of the job task domain or replications of 
test pexformance— which pertain to testing on an aggregate of tasks or an entire job. 

TiST METHOD 

The first critical dimension of a peirformance test to be considered pertains to the 
directness or relevance of what I will call the method of testing. A test method is rr^le- 
vant or direct if it evokes a performance that is the same as that specified in the actual 
job'task. The scope and fidelity of actuai job or life conditions preiented and the realism 
of the response medium used, thus deterrnine the directness of t^st method. In a training 
or other performaiice assessment setting, limited resources often prevent a direct task 
enactment appf oach to testing. Indirect methods are often used which involve simulatiovi 
of task conditions or which require only partial task perforrnance. These commonly 
result in testing on only part of the task-^usually the more testable part. Paper-and-peneil 



knowiedge tests on tasks with both knowledge and skill requirements represent the most 
flagrant example of indirect test method. Tests of job knowiedge are relatively economi- 
cal and have exceptional psychometric properties. Yet we would not for a moment 
consider licensing a man to fly a plane or drive a ear, merely on the basis of a knowledge 
test. The reason for this is obvious. But why then, in other job or job task areas do we 
tend to accept job knowledge as a valid measure of performance capability? As indicated, 
the chief reasun is cost, A performance test seeks to present the real Work environment 
with all its cues, then elicit the actual job behavior as directly as possible. Such a repre- 
sentation of the real world is expensive, Training and personnel raanagiers tend to think 
performance tests require too much in the %vay of equipment, personnel and time to 
justify their use. But to insist that a test of job knowledge is the OAily alternalive, I 
believe reflects a false dilemma. 

For a given job task several alternate test methods are potentially available, Thesi^ 
will lie between an expensive but fully relevant performance test, on the one hand, and 
a relatively inexpensive but marginally valid knowledge test, on the other. Elsewhere^ 
I have described an approach to devising alternate test methods; an approach based on 
the concepts of simulation and task-element sampling. Tests resulting from the approach 
I have collectively termed Synthetic Performance Tests, The intention is to connote a 
process of synthesis by which the substructure of a job task id used as the bams for 
selectively constructing alternate forms of a test, each representing (at least theoretically) 
a more or less optimal blend of validity and feasibility. In some cases this may be 
achieved through simulation; that is, by substituting for stimuli in either the task display 
or the surround^ or by requiring a substitute response. In other cases, efficient tests 
may be created by testing on a subset of task elements^ regardless of whether simulation 
is used or not* Thus, synthetically generated alternatives to fully relevant performani-e 
tests may vary in two major dimensions, fidelity and scope. 

For example, consider an electronic troubleshooting task. Knowing the correct test 
sequence for isolating a faulty equipment component is only part of the task. Among 
other task elements the troubleshooter must also be able to place the test-set in operation, 
establish a good connection at the teat points, and correctly interpret the test readouts. 
Can this type of job task be adequately— that is, valldiy— tested with the traditional, 
verbally formatted test of job knowledge? I would say, no, In fact, experience may 
reveal that, on the job, the most frequent cause of faulty troubleshooting is the inability 
of the troubleshooter to establish good coimections at the test points-^an essentially 
physical or manipulative element in the task performance. So^ assuming the test 
developer cannot afford the luxury of a direct^ hand'-on method of testing, the important 
thing is that he does not immediately revert to the typical knowledge test. He should 
use his inventiveness in devising alternate test methods that will call for the demonHtra- 
tion of behavior that is as similar as possible to that actually required in task performancte. 
Pictorial, graphic, or even low cost three dimensional simulgtors should be considered. He 
may then assess the relevance of these synthetic options by checking the breadth and 
criticality of task elements that are tapped by a particul^ 

Only in this way. It seems to me, can test developers arrive at economical methods 
of proficiency testing while maintaining an acceptable level of content validity. 

TEST CRSTERION 

No\v let me turn to a second dimension of performance testSj that of test criterion* 
All tasks have both a product (outcome) and process (steps in task performance). 
Product measurement however. Is of overriding importance in certifying a person's 
achievement on a job task, and failure to include it as the principal criterion may 
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severely limit test validity. Although it may safely be said that Qvmy task has a purpose, 
the fact of the matter is that in practice a great many performance tests are used which 
employ process measiirement only in evaluating a person's job readiness. 

Before looking more closoly at why process measures are so widely substituted for 
measures of task product we must consider three types of tasks. First there are tasks in 
which the product and the process are one and the same— that is, the product is a process. 
These tasks are few, and normally are found among those which serve an aesthetic pur- 
pose such as springboard diving, dancing, playing a musical composition. Here we see that 
the outcome or product of the task is no more or less than the correct execution of steps 
in task performance— that is, the process. A second type of task is that in which the 
product necessarily follows from the process. Fixed procedure tasks typically fall in this 
category. Troubleshooting an electrical circuit, balancing a checkbook^ changing a tire 
arc examples. In tasks of this type the procedural steps are known, observable and com- 
prise the necessary and sufficient conditions for task outcome; so if the process is cor- 
rectly executed, task product necessarily follows. 

For these first two types of tasks it is not particularly important whether process 
or product measurement is used. But for a third type it is. This is the type in which 
the product is less than fully predictable from the procese^a circumstance which occurs 
either because we are unable to fully specify the necessary and sufficient steps in task 
performance, or because we cannot or do not accurately measure them. In spite of the 
obvious importance of product measurement for tasks in this latter category, in practice 
performance tusts often do not focus on product. And the reasons generally stem from 
practical considerations in which the measurement of task product is viewed as too costly * 
too dangerous, or for other reasons simply too impractical. For eKample, in a first aid 
task involving controlling the bleeding from an external wound, the test developer would 
probably be limited to requiring demonstration of task process; observation of the actual 
task product=restriction of blood flo%v--wouid pre , ly not be possible, for obvious 
reasons. Other situations aro less understandable, li any of you are involved in the field 
of instructor training, you may have observed that a student instructor is evaluated on the 
basis of such process factors as: **had a well organized lesson plan,^' **used visual-aids 
effectively," '*had good eye contact,*' "had good voice projection,*' * 'covered all points 
in the lesson plan,'* etc. Although clearly the product of instruction is student leaning, 
I believe it is Beldom, if ever, used as the criterion for qualifying an instructor^probably 
heuause it would involve a more time consuming method of evaluation. 

I'm sure wo could all testify to other instances in which product measurement is 
not used, Rome of these are justified by cost or safety considerationSj but others are 
not. It seems to me that test developers often fail to see the importance of measuring 
task outcome; or perhaps they merely slight its importance when faced with practical 
limitations in its measurement. The overriding question that a test desipier should ask 
himself in this situation Is, **lf I use only a process measure to test a person's achieve- 
ment on a task, how certain can I be from thi.^, process score that the person would alsd"^^^ 
be able to effect the product or outcome of the task?* Where the degree of certainty 
is substantially less than that to be expected from normal measurement error, the test 
designer should pause and reconsider ways in which time and resource limitations can' be 
compromised in achieving at least an approximation to product measurement. 

TEST CONDITIONS 

Now, let's look at a third dimension of performance tests-^that of standardization 
of conditions under which a test is administered, This is an important step in achieving 
test reliability. Indeed, the very essence of any proficiency measure which professes to 



be a ttstj is that of stand^diEed conditions. Thii requirement is f^iliax to test davelopsra 
and is therefore less often violated. An effort is normally made to maintain test instmc- 
tionSj materials, toolSi and other environmental factors m nearly constmit aa possible from 
one test adminiitration to the next. However, I would like to call to your attention one 
particular class of tasks which is particularly troublesome in this regiffd* tesks involving 
mterpersonal behavior. Herej another person or group of people represent an important 
pat of the environment to be controlled^that is standardi^ed^from one test adminis' 
totion to the next. Examples are seen in such are^ as counselingj s^esmanship, per- 
sonnel managements or In somettiing like hand'to-harid combat. Tasks in these areas all 
entail othtir people m part of the task relevant conditions; md obvloi^sly people are diffi- 
cult to standardize. If you were interested in assessing a policemaji's ability to properly 
subdue an unanned but hostile suspect, what would your perfomiance test be like? And 
how would you insure that test conditions were stand^dlEed over all policemen to be 
tested? The same question might be asked in relation to assessing a would-be supervisor's 
ability to persuade a worker to perform some difficult or unpleasant task* 

Unfortunately, I know of no easy solution to this problem. Probably, the direction 
that test designers should take is toward greater use of the well trained, ^^standardized 
other'* in controlled role-playing situations. In any casej the product in these kinds of 
tasks is some defined j observable change in that t^k^relevant "other.*' And, here, greater 
effort should be made to avoid settling too quickly for some probably i^evalent measure 
of task process. 

TEST SCORING 

The fourth and final dimension essential to perfonnance tests is that of test scoring. 
Scoring-protocols impact primarily on reliability, but if possly mishandled in test design, 
as I will point out in a moment, they may also jeopardize test validity. Scoring procedures 
involve translating an observed test outcome into an objective pass-fail score. Such pro- 
cedures should be structured so that only the more reliable perceptu^ skills are used; that 
is, the scoring activity should be reduced to one of matching or compauring the test responie 
with some model of correct response. Unfortunately rssponses in many test situations 
seemingly cannot be judged in this **elther or** fashion, but require a ^*more-or-less'* type 
of judgment. When this occurs the test developer should not, as is sometimes done, escape 
by using a test method that yields a more measurable outcome, because test vadidity may 
suffer. Rather, he should remain with the task-relevant response and strive to break it 
down into elements so that comparative judgments cm be made more easily by a scorer, 
A familiar illustration of what I mean is seen in typical progmms of knowledge testing. 
The pervasive multiple-choice test yields responses which can be scored with maximuni 
reliability. Obviously, scorers have little difficulty in matching a selected response sterna- 
tive with that which is keyed as correct by the test developer. The scoring of essay Costs, 
on the other hand, hm traditionally presented reliability problems. Yet in spite of the 
scoring problems inherent in essay testing, the competent test developer would not resort 
to multiple-choice testing on knowledge tasks demanding recall or generation of materia 
merely to achieve peater scorer reliability. Normally he would provide a model response 
in the form of an exhaustive list of the critical elements of an acceptable essay response, 
the presence of which can be judged with relative objectivity by a qualified and earnest 
scorer. 

This same thinking applies to the development of scoring protocols for performance 
tests if these tests Me to produce reliable results. The subjectivity with which many 
t^k performances are customarily scored could be substantially reduced j it seems to me, 
through wider use of what may be termed scoring templates/ Where the model responst* 



on a test of markHmanHhip is defined as a hole in th© bullseye, it is relatively easy for 
the scorer to judge the acceptability of the reiponse made by the rifleman. Thia is 
because tiie concentric eircleH normally marked on a target act as a kind of simple tem- 
plate which enhances the easu and objeutivity of scorer judgments as to the ntaniess 
of a hit to the center of the target. Templates could be applied equally well in scoring 
other tests. For example, tasks mentioned earlier in which the outcome is a process 
are often troublesome to assess reliably, It would appear that performances such as 
springboard diving or ^ranastic exercises could be more objectively scored if the out- 
comes were filmed and figural templates over! ay ed on key frames to assess the accuracy 
of the performer at those critical points in the response. Similarly, in evaluating the 
performance of a music student, recordings of selected renditions could be analyzed at 
the scorer's leisure perhaps with the aid of auditory templates such m a metronome to 
measure beat or comparative tones to lasess tonal quality. For these particular t^ks— 
or for that matter ^ Bny task in which the product is transient^the added cost in recording 
the product for scoring later would probably be offset by savings in scoring costs; that 
iSj the more objective approach to scoring would very likely preclude the usual require- 
ments for a panel of expert evaluators/ But more importantly the scorer would not 
be constrained by real time, and could function at a place and time and rate of his choos- 
ing, using prepared templates to further the objectivity of his judgments. 

Thus we have what I consider to be the four essential dimensions of a performance 
test: directness of test method, type of eriterions standardization of conditions^ and 
objectivity of scoring. For simplicity these factors have been described as if each were 
dichotomous, when in actuality each is a continuum; a test method may be more or 
less direct j conditions more or less standardised. Moreoverj as shown here, the dimen- 
sions are depicted as independent, when in practice they are not^fof instance, indirect 
methods of testing are often used to attain objective scoring; and pruceHH criteria to 
achieve standardized conditions. 

Nevertheless, this simple fmmework provides a useful analytic tool for developers 
and users of performance tests/ It can guide the development of a test, or be used after 
the fact to identify weaknesses in exiiting tests. More generally it identifies prr^blem 
areas confronting the performance testing practitioner— problem areas which must be 
addressed by research and creative development work if performance tests are to be used 
validly and reliably. 
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